1.1 I am rather exited to start this course, although I am a bit worried that the workload will be too much for me. I hope though, that if I manage with the the workload, I will have the skills to start independently developing as a user of R and a contributor to open data science. It is especially the latter point that brought me to this course, although I did not actively look for it. I just happened to run into it on the course page of the University of Helsinki.
1.2 In what has been a rather unpleasant experience, GitHub did not immediately work for me. Nothing seemed to upload to my diary. Now, I think, I have managed to overcome my issues and get things to work most of the time. Learning by doing, I suppose!
To summarize:
- I want to learn open data science.
- I am worried the workload will be too much.
- I look forward to the course.
- I hope I will figure out the specifics of git and rstudio.
Here’s the link to by GitHub repository.
Libraries:
library(dplyr)
library(ggplot2)
library(GGally)
This chapter analyses a selection of data from a 2014 survey of students participating in an introductory statistics course in Finland. The survey mapped students’ learning approaches and learning achievements. While the original data contained 183 observations of 60 variables, a more limited dataset of 166 observations of 7 variables will be employed here. These variables are the age and gender of the participants, their points from the course representing their performance, their attitude towards the course, and three variables mapping their learning styles. These learning styles were the “surface approach,” indicating memorization without deeper engagement, “deep approach,” indicating an intention to maximize understanding of the subject matter, and “strategic approach,” indicating an approach aimed at maximizing the students chance at a good grade. The variables “attitude,” “surface approach,” “deep approach,” and “strategic approach” are all aggregate mean measures of other variables. As such, each variable summarizes related observations into an average. This analysis used the below script, in combination with existing knowledge, to interpret the dataset:
Learn2014 <- read.table("Data/Learn2014", header = TRUE, sep = "\t")
Learn2014$gender <- factor(Learn2014$gender, levels=c("0","1"), labels=c(0,1))
str(Learn2014)
## 'data.frame': 166 obs. of 7 variables:
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ gender: Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 2 ...
## $ attit : num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ strat : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Points: int 25 12 24 10 22 21 21 31 24 26 ...
The below graphs and summaries of the data help us gain an initial picture of the trends present therein. For one, we can see that a vast majority of students participating in this survey were female (110 v. 56 males), with a mean age of 25 and a half years and approx. 75% of students being below the age of 27.
As for the variables related to studying, all of them approximate a normal distribution, although with a slight skew to the right. Certain immediately interesting pieces of information arise from the correlation numbers. Firstly, positive attitude is strongly correlated with higher points, while the deep approach seems to counter intuitively have little effect on performance. Nevertheless the surface approach seems to predict a slightly worse performance, while the strategic approach predicts a slightly better performance. Curiously, age among men seems to predict a worser performance, although this might be due to two outliers. We shall next test these initial findings with a multiple linear regression.
Graph_AgeGeN <- ggpairs(Learn2014, columns = c(1, 2), legend = 3, title = "Age and Gender", mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
Graph_AgeGenPoints <- ggpairs(Learn2014, columns = c(1,2,7), title = "Effects of Age and Gender on Points", mapping = aes(shape = gender, col = gender), lower = list(combo = wrap("facethist", bins = 20)))
Graph_PredPoints <- ggpairs(Learn2014, columns = c(3:7), title = "Attitude, Study Style and Points", mapping = aes(shape = gender, col = gender), lower = list(combo = wrap("facethist", bins = 20)))
Graph_AgeGeN
Graph_AgeGenPoints
Graph_PredPoints
summary(Learn2014)
## Age gender attit deep surf
## Min. :17.00 0: 56 Min. :1.400 Min. :1.583 Min. :1.583
## 1st Qu.:21.00 1:110 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.417
## Median :22.00 Median :3.200 Median :3.667 Median :2.833
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :2.787
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.167
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :4.333
## strat Points
## Min. :1.250 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:19.00
## Median :3.188 Median :23.00
## Mean :3.121 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:27.75
## Max. :5.000 Max. :33.00
For the below multiple linear regression, three predictor variable have been chosen: Attitude, the surface approach, and the strategic approach. These variables were chosen due to their relatively higher correlations compared to other available variables (age for males is excluded due to the presence of outliers skewing the calculation). The below multiple linear regression shows that only attitude has a statistically significant impact on points, as it is the only independent variable that has its p-value below 0.05. In the case of attitude, there is a less than 0.1 percent chance that the null-hypothesis (attitude has no effect on points) is correct under the observed circumstances. Not only is attitude a statistically significant predictor of points, it also seems to have a strong impact, with its beta coefficient being approx. 3.4. This means that with each 1-point step towards a better attitude on the linkert scale, points seem to rise approximately by 3.4.
With the remainder of data, the likelihood is above 5 percent, which is the conventional cut line for statistical significance. This interpretation is also supported by the t-values, which conventionally are expected to be larger than 2, or lesser than -2, to indicate statistical significance. Altogether, this model nevertheless only explain approximately 20% of the variation in points, meaning that is not a very good predictive model.
Points_regression <- lm(Points ~ attit + strat + surf, data = Learn2014)
summary(Points_regression)
##
## Call:
## lm(formula = Points ~ attit + strat + surf, data = Learn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attit 3.3952 0.5741 5.913 1.93e-08 ***
## strat 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
To further test the significance of attitude, the Surface Approach and Strategic Approach variables will be removed and a simple linear regression carried out with just attitude as the predictor variable. This, nevertheless, produced no novel results and with the dropping of variables, also the explanatory power, Multiple R_squared, of the model has gone down from 0.2 to 0.19. This means that changes in students’ attitude can help explain 19% of the changes in students’ score. The fact that the reduction is so minor is further indication of the minor impact of Surface Approach and Strategic Approach variables. To play around a bit, I have also included a multiple linear regression with age included. Nevertheless, neither this has had any effect on the model. The slight rise in R-squared is to be expected every time a predictive variable is added.
Points_Attit_Reg <- lm(Points ~ attit, data = Learn2014)
summary(Points_Attit_Reg)
##
## Call:
## lm(formula = Points ~ attit, data = Learn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attit 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
AgeAttit_Regression <- lm(Points ~ attit + Age, data = Learn2014)
summary(AgeAttit_Regression)
##
## Call:
## lm(formula = Points ~ attit + Age, data = Learn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3354 -3.3095 0.2625 4.0005 10.4911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.57244 2.24943 6.034 1.04e-08 ***
## attit 3.54392 0.56553 6.267 3.17e-09 ***
## Age -0.07813 0.05315 -1.470 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared: 0.2011, Adjusted R-squared: 0.1913
## F-statistic: 20.52 on 2 and 163 DF, p-value: 1.125e-08
To validate the model, this final section will run three plots to test that the assumptions of a regression model are filled by the data. For this validation, the simple linear regression model of Points_Attit_Reg will be used, as it is the most efficient of the models produced. The below graphs “Residuals vs. Fitted,” “Normal Q-Q,” and “Residuals vs. Leverage” test whether the assumptions of normal distribution, non-correlation, and constant variance of errors are met.
The Q-Q plot tests whether errors are normally distributed. The below graph shows that the dots fit reasonably on the line, although as we move towards the more extreme quantiles we can see that the distribution shows signs of being leptokurtic and as such might not be normally distributed. Nevertheless, this analysis interprets this distribution of errors as normal.
The Residuals vs Fitted graph tests the assumption of constant variance of errors by plotting residuals against predicted values. As we can see no discernible pattern in the data, we can interpret the graph as showing no indication of the size of the error depending of the predicted value. Thus constant variance of errors is established
The Residuals vs Leverage graph shows us that none of the datapoints have an unreasonably high power to pull the models predictions outwards themselves. This means that there are no outliers in the dataset. This, in combination with the above tests indivated that the model is valid, as it adheres to the integral assumptions of linear regression.
plot(Points_Attit_Reg, which = c(1,2,5))
Libraries:
library(dplyr)
library(ggplot2)
library(GGally)
library(boot)
2. and 3.
Data Description with Variable Selection and Justification
The below glimpsed dataset “TheData,” contains the questionnaire answers of 382 students from two Portuguese secondary schools . The answers were given by students attending maths and Portuguese language courses, each group having produced their own datasets that have here been combined into one dataset. In the process of combining the data, observations have been selected in a manner that assures that 13 identifying variables do not contain empty values. This has resulted in a reduction from 1044 to 382 observations per variable.
The questionnaire was created to predict the target variable of “G3,” ie. the final grade of the student attending the course. Accordingly the variables can be said to have at least a potential link to school performance, although some variables (such as whether the student lives in an urban or rural area) arguably have a more tenuous theoretical link to school performance than others (such as whether the student receives additional educational support). A glimpse of the data is provided below:
## Rows: 382
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F"…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U"…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "L…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T"…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servi…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other",…
## $ reason <chr> "course", "course", "other", "home", "home", "reputation",…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes"…
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother"…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes"…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "y…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5…
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 1…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, …
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0…
## $ alcoholics <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
For the purposes of this analysis, four variables relating to alcohol use have been selected. The primary purpose of this analysis is to examine the effects of alcohol use on the final grade. Accordingly the variable “G3,” the final grade on a 20-point scale, is a given. As is “alc_use,” the variable mapping alcohol use on a five-point scale where “1” indicates very low consumption and “5” very high consumption (This variable is the mean of the student’s alcohol consumption on weekdays and weekends, mapped by variables “Dalc” and “Walc,” respectively). The hypothesized relationship between “G3” and “alc_use” is that the higher consumption of alcohol is a predictor of lower achievement in school, which is represented by G3. Furthermore, it is hypothesized that the mechanism that might explain any potential causal relationship is the amount of absences (measured in days out of 93) arising from either reduced energy or hangover caused by higher amounts of drinking (I realize that this is a bold assumption to make before examining the relationship between alcohol use, absences and the final grade, but the tasking requires naming the four variables now.) The benefit of this causal explanation is that it does not require knowledge on the effects of alcohol use on the brain, nor does it, in the case of having such knowledge, demand that the high use is long term - a common qualifier with alcohol related learning difficulties, but something for which the dataset contains no data.
As the working theory is that higher alcohol use has a negative effect on school performance, it is also useful to theorize about the reasons behind higher alcohol use. Here two variables are examined: “freetime,” ie. how much free time the student has in a week on a five-point scale (1 denoting very little, 5 very much), and “famrel,” ie. how good the students relationship is to their family on a five-point scale (1 denoting a very bad relationship, 5 an excellent relationship.). The theorized relationships are as follows: the more free time one has, the more they drink to pass the time, and the worse their relationship is with their family, the more they drink for comfort (The same cave-at applies here, as with the previous relationship). These are the relationships that will be explored below: A) The effects of alcohol use on the final grade; B) The effects of alcohol use on absences and the effects of absences on the grade; C) The effects of free time on alcohol use; D) The effects of family relations on alcohol use. Any further interesting relationships will be explored as warranted by the initial results (such as the effects of family relationship, given a lot of free time, on alcohol use).
4.
Numerical and graphical exploration of relationships A through D.
A and B
The above set of graphs explores the relationships between alcohol use, absences and the final grade. The results have been further divided by sex in the spirit of last week. A few noteworthy points can immediately be noticed. Firstly, there seems to be, overall, no statistically significant relationship between the number of absences and the final grade. This, if anything, is a troubling result for Portuguese teachers. Admittedly, with males there seems to be a somewhat statistically significant relationship. On the other hand, alcohol use would seem to predict both higher levels of absences and lower scores, although here too the difference between males and females is significant.
Since there is no theoretical reason for this division, it raises some questions over the data. As such, before delving into the numbers further, we need to examine the data more to see if the cause for these variations between sexes can be explained by abnormalities in the observations. Immediately two observations jump up from the data: in the column where absences are on the Y-axis, we can note two observations, both female, that could constitute outliers. To examine this further, we will carry out a regression analysis where the absences are the explanatory variable for final score, and a regression analysis where the alcohol use is the explanatory variable for absences. Both analysis will be then subjected to the residuals vs leverage test from last week, which will help us indicate whether some of the datapoints have an unreasonably high power to pull the models’ predictions outwards towards themselves.
##
## Call:
## lm(formula = G3 ~ absences, data = TheData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7235 -1.6055 0.3355 2.3355 6.8688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.72350 0.21879 53.583 <2e-16 ***
## absences -0.05897 0.03094 -1.906 0.0574 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.299 on 380 degrees of freedom
## Multiple R-squared: 0.009471, Adjusted R-squared: 0.006865
## F-statistic: 3.634 on 1 and 380 DF, p-value: 0.05738
##
## Call:
## lm(formula = absences ~ alc_use, data = TheData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.417 -3.442 -1.442 1.576 41.558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2523 0.5918 3.806 0.000165 ***
## alc_use 1.1901 0.2779 4.282 2.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.343 on 380 degrees of freedom
## Multiple R-squared: 0.04603, Adjusted R-squared: 0.04352
## F-statistic: 18.33 on 1 and 380 DF, p-value: 2.349e-05
It seems that the two data points have such a high leverage, as to bring their validity into question. Of course, in the absence of reasoned proof that they are invalid, they should be left in. In the interest of this exercise, I have nevertheless decided to apply the rule of thumb that observations with a Cook’s Distance higher than n/4 (where n is the number of observations) can be removed. Let us see what the end result is, after we apply this procedure to the data.
##
## Call:
## lm(formula = G3 ~ absences, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7846 -1.6377 0.2888 2.2888 6.9496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.78457 0.22435 52.529 <2e-16 ***
## absences -0.07342 0.03461 -2.121 0.0346 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.28 on 376 degrees of freedom
## Multiple R-squared: 0.01183, Adjusted R-squared: 0.009198
## F-statistic: 4.5 on 1 and 376 DF, p-value: 0.03455
##
## Call:
## lm(formula = absences ~ alc_use, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.885 -3.395 -1.397 1.603 41.595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4130 0.5349 4.511 8.63e-06 ***
## alc_use 0.9921 0.2533 3.917 0.000107 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.791 on 376 degrees of freedom
## Multiple R-squared: 0.0392, Adjusted R-squared: 0.03664
## F-statistic: 15.34 on 1 and 376 DF, p-value: 0.0001066
The new Dimensions
## [1] 378 35
With the removal of just four observations with a Cook’s distance higher than 4/n, as testified by the new dimensions, we can see that absences now function as a statistically significant predictor of academic performance. I would argue that despite the absence of observation specific reasons supporting the removal of the observations, the overall logical expectation that presence at class predicts performance, and the magnitude of change in the statistical significance of the results, warrants the removal of these values. As such, moving onward, this analysis relies on the now modified dataset.
Finally, to test these results against just the effects of alcohol use on the final grade and the effects of absences, given high alcohol use, we will conduct two more regression analysis:
##
## Call:
## lm(formula = G3 ~ alc_use, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4094 -1.8989 0.1011 2.1011 6.3459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3884 0.3645 33.984 < 2e-16 ***
## alc_use -0.4895 0.1726 -2.836 0.00482 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.265 on 376 degrees of freedom
## Multiple R-squared: 0.02094, Adjusted R-squared: 0.01833
## F-statistic: 8.041 on 1 and 376 DF, p-value: 0.004819
##
## Call:
## lm(formula = G3 ~ absences, data = filter(TheData_2, alc_use >
## 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8329 -0.7404 1.1671 1.8142 5.6294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.83286 0.93802 11.549 1.72e-13 ***
## absences -0.04622 0.11495 -0.402 0.69
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.606 on 35 degrees of freedom
## Multiple R-squared: 0.004599, Adjusted R-squared: -0.02384
## F-statistic: 0.1617 on 1 and 35 DF, p-value: 0.69
Looking at the results of the regression analysis, we can see that despite all the work that went into removing the high Cook’s Distance observations, alcohol use on its own is still a stronger and statistically more significant predictor of poorer academic performance than the number of absences. We also see that the number of absences given high alcohol use does not provide anything better in terms of predictive power than just absences. As such, contrary to the originally proposed mechanism, while alcohol use is a statistically significant and strong predictor of absences (one move up the alcohol use scale corresponds to almost one full day of additional absences), absences themselves do not function as a strong predictor of poorer academic performance. In fact absences only explain approximately half of the change in final grade that is explained by alcohol use. We can as such conclude that there is strong evidence that while alcohol use results in poorer performance, it does not do that through absences.
C and D
As expected, both negative family relations and free time are statistically significant predictors of alcohol use. We can examine these in more detail with linear regression, as has been done below. We can see that both variables are statistically significant predictors of alcohol use. As for the hypothesized impact of poor family relations, given lots of free time, it does not have an effect larger than just poor family relations. In fact, given free time, poor family relations seem to have a lower effect, but this difference is not statistically significant:
##
## Call:
## lm(formula = alc_use ~ freetime, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1822 -0.8364 -0.1822 0.5095 3.1636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.31755 0.16884 7.803 6.02e-14 ***
## freetime 0.17294 0.05015 3.449 0.000627 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9604 on 376 degrees of freedom
## Multiple R-squared: 0.03066, Adjusted R-squared: 0.02808
## F-statistic: 11.89 on 1 and 376 DF, p-value: 0.0006273
##
## Call:
## lm(formula = alc_use ~ famrel, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2930 -0.8469 -0.2576 0.4924 3.2778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.43567 0.22189 10.977 < 2e-16 ***
## famrel -0.14269 0.05497 -2.596 0.00981 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9668 on 376 degrees of freedom
## Multiple R-squared: 0.0176, Adjusted R-squared: 0.01499
## F-statistic: 6.738 on 1 and 376 DF, p-value: 0.009808
##
## Call:
## lm(formula = alc_use ~ famrel + freetime, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3850 -0.7974 -0.2116 0.5067 3.0067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.92585 0.25381 7.588 2.6e-13 ***
## famrel -0.17339 0.05452 -3.180 0.001594 **
## freetime 0.19586 0.05007 3.912 0.000109 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.949 on 375 degrees of freedom
## Multiple R-squared: 0.05612, Adjusted R-squared: 0.05108
## F-statistic: 11.15 on 2 and 375 DF, p-value: 1.982e-05
##
## Call:
## lm(formula = alc_use ~ famrel, data = filter(TheData_2, freetime >
## 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20823 -1.00998 -0.07606 0.85786 2.99002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.34040 0.45699 5.121 9.64e-07 ***
## famrel -0.06608 0.11041 -0.599 0.55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 143 degrees of freedom
## Multiple R-squared: 0.002499, Adjusted R-squared: -0.004477
## F-statistic: 0.3582 on 1 and 143 DF, p-value: 0.5504
As such, we can conclude section 4 by summarizing that while alcohol has a negative effect of academic performance, and poor family relations and free time increase alcohol consumption, all of these have (while statistically significant) only have modest impact, if we look at R2: Alcohol use only explains approximately 2% of the variance in the final grades, while poor family relations and free time, even when taken together, only explain approximately 5% of the variance in alcohol consumption. As such, while we have some proof of causal relationships, those relationships are not strong. We can additionally reject the hypothesis that the mechanism by which alcohol consumption affects grades is the increased amount of absences.
5.
Logistic Regression of the above variables.
In the above analysis we have treated alcohol use either as an explanatory variable (A-B) or as a target variable (C-D) in a linear function. Here, alcohol use will be defined as a binomial variable, where individuals having an alcohol consumption higher than 2/low, will be labeled as “alcoholics.” As such, individuals with alc_use of three or higher will belong to the category “alcoholics,” while those with less will not. To model the other above variables within this framework will require the use of Logistic Regression, which calculates the probability of an individual belonging to a category (here, alcoholics), based on the model inputs. A probability higher than 0.5 will indicate belonging to a group.
We will employ all the other variables used above, including the variable absences, since it did have a statistically significant relationship with alcohol use. Consequently we get the following Logistic Regression:
##
## Call:
## glm(formula = alcoholics ~ famrel + freetime + absences + G3,
## family = "binomial", data = TheData_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6726 -0.8342 -0.6456 1.1785 2.0220
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.56725 0.73233 -0.775 0.438585
## famrel -0.31613 0.12903 -2.450 0.014280 *
## freetime 0.41441 0.12521 3.310 0.000934 ***
## absences 0.07175 0.02371 3.027 0.002473 **
## G3 -0.06959 0.03593 -1.937 0.052781 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 455.91 on 377 degrees of freedom
## Residual deviance: 424.42 on 373 degrees of freedom
## AIC: 434.42
##
## Number of Fisher Scoring iterations: 4
The Odds Ratios and Their Confidence Intervals
## OR 2.5 % 97.5 %
## (Intercept) 0.5670801 0.1325241 2.3671826
## famrel 0.7289627 0.5649509 0.9387214
## freetime 1.5134784 1.1886426 1.9440711
## absences 1.0743848 1.0268123 1.1277893
## G3 0.9327736 0.8689267 1.0008341
In the above summary we can see that the variables used have a wide range of statistical significance. As the commonly accepted cut-off point for statistical significance is a p score of less than 0.05, alongside absolute z values higher than 2 and 95% confidence intervals that do not include 1, we can conclude that the variable G3, or the final grade, has no statistical significance on our model. As such, we can drop it going forward. * (Confidence intervals going across 1 indicate that the 95% confidence interval contains the coefficient 1, indicating no relationship between the predictor and target variable.)
The odds ratios support our initial hypothesis. Since odds ratios higher than 1 indicate that the variable is positively correlated with the observation/individual belonging to the group (in this case alcoholics), both free time and absences positively predict belonging to the alcoholics group.
Since higher family relations negatively predict belonging to alcoholics, we can conclude that the hypothesized positive impact of bad family relations holds.
Nevertheless, as stated above, the impact of these variables is minor, falling close to even.
(As final grade is not statistically significant, it has been ignored)
6.
The below numerical and graphical explorations detail the accuracy of the model without variable G3. While the plot would seem to indicate a rather random sorting of predictions, a closer examination carried out through the tabulation of predictions against the data showcase a more nuanced model. It is rather clear that the model over predicts non-alcoholics and if it does predict an alcoholic, there is (approximately) a 50/50 chance of that being prediction being right, but since the majority of cases are non-alcoholics, the model’s training error is “only” 0.29, meaning that 29% of the predictions are incorrect. This is above mere random guessing, or flipping the coin, especially since the alcoholics and non-alcoholics are not split 50/50. Nevertheless, we can see both from the graph and the confusion matrix, that the model misses many, many cases where the individual does belong to the group “alcoholics.” As such, it is not a good model.
##
## Call:
## glm(formula = alcoholics ~ famrel + freetime + absences, family = "binomial",
## data = TheData_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7039 -0.8148 -0.6661 1.2152 1.9476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.33999 0.61829 -2.167 0.030215 *
## famrel -0.32602 0.12889 -2.529 0.011424 *
## freetime 0.41713 0.12437 3.354 0.000797 ***
## absences 0.07582 0.02361 3.211 0.001323 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 455.91 on 377 degrees of freedom
## Residual deviance: 428.17 on 374 degrees of freedom
## AIC: 436.17
##
## Number of Fisher Scoring iterations: 4
## prediction
## alcoholics FALSE TRUE
## FALSE 255 13
## TRUE 98 12
## [1] 0.2936508
BONUS
By using a ten-fold cross-validation, we can train the data on one-tenth of “TheData” and then check its accuracy (defined by the ratio of incorrect guesses as above) against the remaining nine sets of one-tenth of the data. This is done below. The ratio of 0.3 indicates that the model that was trained on one tenth of the data performs similarly within the rest of the data compared to the model trained on the whole data. It is worse than the one introduced in the DataCamp. I was able to find a better one after having played around with the above variables in connection with sex, failures and goout.
## [1] 0.3042328
I was able to find a better model after having played around with the above variables in connection with sex, failures and goout. This has an error rate of 0.24 in a ten-fold cross-validation
## [1] 0.2407407
THE END!
The dataset, “Boston,” used in this analysis can be dowloaded with the “MASS”-package. As such, it can be seen as a training dataset of sorts. It contains 14 variables with a (potential) connection to housing values in the suburbs of Boston. These variables are:
| Variable | Explanation |
|---|---|
| “crim” | per capita crime rate by town. |
| “zn” | proportion of residential land zoned for lots over 25,000 sq.ft. |
| “indus” | proportion of non-retail business acres per town. |
| “chas” | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| “nox” | nitrogen oxides concentration (parts per 10 million). |
| “rm” | average number of rooms per dwelling. |
| “age” | proportion of owner-occupied units built prior to 1940. |
| “dis” | weighted mean of distances to five Boston employment centres. |
| “rad” | index of accessibility to radial highways. |
| “tax” | full-value property-tax rate per $10,000. |
| “ptratio” | pupil-teacher ratio by town. |
| “black” | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| “lstat” | lower status of the population (percent). |
| “medv” | median value of owner-occupied homes in $1000s. |
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Each variable has 506 observations/points of data and the below computational analysis would seem to indicate that it contains no empty points of data and each of the 14x506=7084 observations is numeric/integer. Some observations are ratios, some percentages, and at least one (chas) is a dummy variable coded 0/1.
Empty <- 0
for (x in row(Boston)){
if (is.na(x) == TRUE){
Empty <- Empty+1
}
}
Numer <- 0
for (x in row(Boston)){
if (is.numeric(x) == TRUE){
Numer <- Numer+1
}
}
Below, the reader can find the simple bar graphs of each variable, as well as a summary of each examined variable:
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
As the above general overview indicates, the data takes various values in various ranges - as one would expect from a dataset containing various different measures. Commenting on all the distributions seems pointless at first sight, but most of the graphs indicate some interesting things.
To begin with, we see the age-graph indicate an aging city. More importantly we see that its automatic scale seems off. Either there is a high amount of areas in the city where the proportion of buildings built before 1940 close to 100, or the dataset has a typo. Or, as a final thought that seems the most likely: many of properties surveyed for this dataset come from the same Boston town/area and hence share exactly the same variable observations for some area-specific variables.
We see the black-graph empty. Yet again, a closer examination (carried out below) shows that the data takes some interesting values. I do not have the knowledge to say what the measure “1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town” should indicate, but a approx. 120 of the properties seem to get a value close to 397, while other values occur only once. In fact, the Summary statistic presented above shows that the value is probably 396,90 and that the variable is also curious since most of its observations are within the range of 370 and 400, but its smallest observation is 0.32. It might be that the small observation has not been multiplied by 1000 as per the formula, since the result would come close to the expected range. Yet again, this can be a sign of a typo, or something else going on. And as above, the repetition of one value might be explained by many of properties surveyed for this dataset coming from the same Boston town/area and hence sharing exactly the same variable observations for some area-specific variables.
The Crim-graph also looks empty, but again, the more detailed look below shows that most values range from 0 to 1, as one would expect from a per-capita rate. The fact that the general graph above has a range of 0 to 75 would indicate a typo or some placeholder value. Perhaps certain observations have been multiplied by hundred to give a percentage, or the person recording the obervation has forgotten a dot/comma. This would seem to be the case based on the summary statistic, since the max-values are high above the median and mean.
The dis-graph looks empty as well, but the below closer look shows the granular level of observations. With no aggregation, the single lines disappear from the graph when it is extended to contain all values. The summary statistic indicates nothing out of the ordinary.
The final empty-appearing graph, indus, seems to suffer from the same issue as the black-graph. As one can see below, most value-counts range between 0 and 10, but around the 17-mark there seems to be one value with a high count, approx 120, of observations. This same issue can also be observed in the tax-, rad-, and p-ratio- graphs, although no detailed look is carried out below. This is further evidence for the fact many of properties surveyed for this dataset might come from the same Boston town/area and hence share exactly the same variable observations for some area-specific variables.
Finally, the zn-graph looks odd as well, but I would argue that that is just the result of strict zoning-laws prohibiting the amount of large properties in most areas (observe the large count in value “0”)
As for the relationships between the variables, the below matrix shows the correlations of each varibale paired with each of the ohters. Of note is the fact that the matrix seemingly indicates that each value has some statistically significant relationship with one each of the other variables. Of note is the only exception, the chas-variable, which is also the only dummy variable. An interesting question in this regard is why the chas variable is the only one that does not have statistically significant correlations with most other variables. The first answer is simple: because the fact that a tract bounds the river has no statistically significant impact on many of the other variables. The second option comes down to the inner workings of R - it might be that the cor.mtest-function used here to map p-values does not function well for dummy variables. No mention of this possibility is given by the ?cor.mtest-command.
On the other hand, it is perhaps not surprising that variables that are expected to be significant predictors of housing prices, also have statistically significant correlations which one-another. Out of these correlations a few should be highlighted in preparation for the coming faces. The variable “crim” (per-capita crime rate by town) seems to a a strong, statistically significant positive correlation with high property-tax properties, as well as properties with easier access to radial highways. Property-crime is a good explanatory factor for these correlations - high tax- and rad-variables indicate high-value targets(former) and/or easy get-away and access options(latter).
higher levels of industry (indus), house age (age), air pollution (nox), pupil-to-teacher ratio (ptratio), and population’s lower status all have a weaker positive correlation with crime rate. This perhaps indicates a second category of neighborhoods compared to the above: the older impoverished industry areas with less access to good education.
Both higher median value and longer distance from employment centers correlated weakly and negatively with a higher crime rate. I have a hard time explaining this. Perhaps it is due to the existence of middle-class suburbs, which are not attractive to property theft due to distance to a poor city center? This conclusion is perhaps supported by the strong negative correlation between the dis-variable on one hand and the indus-, nox-, and age-variables, which would seem to indicate that the (employment) centers of the city are older industry neighborhoods. All of this is of course anecdotal in the absence of clearer information.
Finally, the black-varibale seems to be negatively and weakly correlated with a higher crime rate, but as I do not understand the calculations behind the variable, it is rather hard to interpret the (potential) meaning of the correlation - as such I will drop it going forward.
Below the reader can find a summary of the standardized Boston variables. All of them can be seen to share a mean of zero, which is of course by definition a feature of a standardized variable. They are also all on the same scale now, which means that they can be compared to one another easier - although that would not be immediately clear from the data, since the value-distributions still retain their curious aspects: for example with the variable “black,” the min is still far-far-far to the left from the rest of the data. Additionally, the standardized binomial variable “chas” has arguably become non-sensical. The old value of 0 has been replaced by -0.2723 and the old value for 1 has been replaced by 3.6648.
It should also be noted that none of the variables can be fully standardized into standard normal distributions, since they do not adhere to a normal distribution to begin with. This is, at least in, probably due to the (theorized) over representation of one neighborhood in the dataset.
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
The reader can also note that in this second set, the crim-varibale has been replaced by the Crime factor-variable, as per instructions, and the chas-variable has be returned to its original binomial state. Even further down, the reader can finally find the test set with the removed Crime variable, after the correct answers had been saved.
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
## zn indus nox rm
## Min. :-0.48724 Min. :-1.5563 Min. :-1.4644 Min. :-3.8764
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.9121 1st Qu.:-0.5681
## Median :-0.48724 Median :-0.2109 Median :-0.1441 Median :-0.1084
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.: 0.5981 3rd Qu.: 0.4823
## Max. : 3.80047 Max. : 2.4202 Max. : 2.7296 Max. : 3.5515
## age dis rad tax
## Min. :-2.3331 Min. :-1.2658 Min. :-0.9819 Min. :-1.3127
## 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668
## Median : 0.3171 Median :-0.2790 Median :-0.5225 Median :-0.4642
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294
## Max. : 1.1164 Max. : 3.9566 Max. : 1.6596 Max. : 1.7964
## ptratio black lstat medv
## Min. :-2.7047 Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.2746 Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 1.6372 Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
## Crime Boston.chas
## Lowest :127 Min. :0.00000
## Lower :126 1st Qu.:0.00000
## Higher :126 Median :0.00000
## Highest:127 Mean :0.06917
## 3rd Qu.:0.00000
## Max. :1.00000
## zn indus nox rm
## Min. :-0.48724 Min. :-1.4470 Min. :-1.35225 Min. :-3.8764
## 1st Qu.:-0.48724 1st Qu.:-0.7397 1st Qu.:-0.90350 1st Qu.:-0.6143
## Median :-0.48724 Median :-0.0797 Median :-0.14407 Median :-0.2044
## Mean :-0.03724 Mean : 0.1453 Mean : 0.04408 Mean :-0.1773
## 3rd Qu.:-0.48724 3rd Qu.: 1.0150 3rd Qu.: 0.64339 3rd Qu.: 0.2727
## Max. : 3.58609 Max. : 2.1155 Max. : 2.72965 Max. : 2.3318
## age dis rad tax
## Min. :-2.3331 Min. :-1.24464 Min. :-0.9819 Min. :-1.30676
## 1st Qu.:-1.0711 1st Qu.:-0.88200 1st Qu.:-0.6373 1st Qu.:-0.77572
## Median : 0.2407 Median :-0.26385 Median :-0.5225 Median :-0.34851
## Mean :-0.0695 Mean :-0.07914 Mean : 0.0191 Mean : 0.03053
## 3rd Qu.: 0.9476 3rd Qu.: 0.50107 3rd Qu.: 1.6596 3rd Qu.: 1.52941
## Max. : 1.1164 Max. : 3.28405 Max. : 1.6596 Max. : 1.52941
## ptratio black lstat medv
## Min. :-2.5199 Min. :-3.90333 Min. :-1.32936 Min. :-1.9063
## 1st Qu.:-0.3028 1st Qu.: 0.15651 1st Qu.:-0.73561 1st Qu.:-0.6858
## Median : 0.2977 Median : 0.37402 Median :-0.03754 Median :-0.2047
## Mean : 0.1116 Mean :-0.05187 Mean : 0.16201 Mean :-0.1241
## 3rd Qu.: 0.8058 3rd Qu.: 0.43155 3rd Qu.: 0.76451 3rd Qu.: 0.2166
## Max. : 1.2677 Max. : 0.44062 Max. : 3.09715 Max. : 2.9865
## Boston.chas
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.07843
## 3rd Qu.:0.00000
## Max. :1.00000
Despite the fact that none of the variables adhere to the assumption normal distribution required by the LDA, nor is the Chas-variable continuous as is usually expected, below the reader can find the required LDA-(bi)plot. It contains the categorical crime rate as the target variable and all of the remaining variables as predictor variables (even the black-variable, despite what I said earlier about not using it.) Observing both the biplot and LDA data, we can see that LD1 explains 96 percent of the between-group variance, while LD2 explains three percent and LD3 one percent.
## Call:
## lda(Crime ~ ., data = TrainSet)
##
## Prior probabilities of groups:
## Lowest Lower Higher Highest
## 0.2475248 0.2549505 0.2500000 0.2475248
##
## Group means:
## zn indus nox rm age dis
## Lowest 0.9975950 -0.96831442 -0.9016927 0.4664355 -0.8962317 0.9464894
## Lower -0.0990564 -0.27533781 -0.5519455 -0.1181898 -0.2676529 0.3383924
## Higher -0.3666748 0.08572049 0.3456871 0.2335388 0.4101022 -0.3573452
## Highest -0.4872402 1.01715195 1.0760911 -0.3997261 0.8285983 -0.8533932
## rad tax ptratio black lstat medv
## Lowest -0.6844182 -0.7461689 -0.4815519 0.3738498 -0.7845841 0.53366730
## Lower -0.5369796 -0.4434751 -0.0216152 0.3185990 -0.1278610 -0.02119605
## Higher -0.4155974 -0.3386129 -0.3864863 0.1003974 -0.0902456 0.28795823
## Highest 1.6377820 1.5138081 0.7803736 -0.7504971 0.8421769 -0.67606132
## Boston.chas
## Lowest 0.0300000
## Lower 0.0776699
## Higher 0.1089109
## Highest 0.0500000
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07249515 0.68407956 -0.78481634
## indus 0.03837037 -0.36601100 0.42523741
## nox 0.39228148 -0.68871879 -1.37968157
## rm -0.12585899 -0.16562528 -0.13259686
## age 0.25295819 -0.36972186 -0.09294929
## dis -0.05219315 -0.19708128 0.08940576
## rad 3.39276845 0.74940773 -0.10015293
## tax 0.01091098 0.32379071 0.44562444
## ptratio 0.08522263 -0.00456574 -0.11142726
## black -0.13769050 0.06220555 0.19116497
## lstat 0.15011407 -0.19580643 0.39596927
## medv 0.15433151 -0.35843871 -0.19300073
## Boston.chas -0.40691713 -0.08872385 0.50525679
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9501 0.0373 0.0126
The table below showcases the cross-tabulated results of the predictions against the actual categories. We can see that the model predicts rather accurately the group belonging in for properties in higher and high crime rate areas, while it struggles a bit more in the lower and lowest categories. Indeed the model seems to slightly over-predict higher crime rates, especially when provided with data of a property in a lower/lowest crime rate area. Nevertheless, it out-performs simple quessing, under which a property would have an almost equal 25% chance in belonging to any of these categories, as indicated in the previous section’s model output. As such, a simple random division of the properties into four equal-sized groups would result, on average, in three incorrect predictions per one correct prediction. Such odds are much worse than the odds for the model correctly predicting a property belonging to a lowest crime rate area.
## predicted
## correct Lowest Lower Higher Highest
## Lowest 13 11 3 0
## Lower 5 14 4 0
## Higher 0 10 13 2
## Highest 0 0 0 27
The below two graphs showcase the results of the final K-Means analysis. The first graph details the change the total within cluster sum of squares as we increase the amount of clusters from 1 to 14. The aim is to use the graph to find the optimal amount of clusters. As it is clear that a more granular level will lead to smaller within cluster sum of squares (WCSS) without necessarily being a better grouping devise (Consider for example that the smallest within cluster sum of squares comes from having only one observation in each “cluster”, meaning that no clustering has been done), we need to find a point where the WCSS drops drastically, indicating an amount of clusters that is significantly more precise than a smaller amount, but not significantly less precise than a larger amount. The first graph indicates that that point is two (2) clusters.
As for the pairs analysis produced by the clustering of pairs of variables into two clusters, we will only discuss the top row/first column of the graphs, which relate to the crime rate. This is done for purposes of limiting the discussion to the relevant aspects and not covering each of the 182 squares. What we need to keep in mind is that K-Means analysis that clusters into two groups attempts to find sets of two sets of data, where the total (in this case) euclidean absolute distances to the group mean are the smallest. If we were to have a single group which shares many of it observation values, then it would be expected, that such a group would repeat itself in each graph. And, indeed, we see most of the crime-graphs maintain a very similar, flat/narrow red-group structure throughout the groupings. To me, this is further evidence that the 120 uncommonly consistently-valued observations that section 2 identified in multiple variables come from a single group of properties from the same area. Perhaps the data showcase something else as well, but hopeflly this will suffice. This is already a long text.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.2663 4.6116 4.7275 5.9572 13.8843